source('../../workflow/resources/annotateVariants.R')
sampleName <- 'Br26'
inputFolder <- '/cluster/work/bewi/members/jgawron/projects/CTC/input_folder'
annotations <- annotate_variants(sampleName, inputFolder)
For each cluster (defined by color), we computed a pairwise distance for each mutation pair that indicates how often the two mutations occur in the same private branch of cells from the cluster:
dist(M1, M2) = 0 (for M1 = M2)
dist(M1,M2) = 1 - (%samples where M1 and M2 are both in the same private branch of a cell from the cluster) (elsewise)
A private branch is defined as the path from a leaf to the node just below the LCA of this leaf to another leaf from the same cluster.
This is a generalization of the earlier method to find the top seperating mutations of pairs of leafs. The generalization was necessary to handle the larger clusters that were broken in more than 2 pieces.
clusterName <- "lightcoral"
d <- read.table(file.path(inputFolder, sampleName, paste0(sampleName, "_postSampling_", clusterName, ".txt")), header = TRUE, sep = "\t", stringsAsFactors = F, row.names = 1)
mat <- as.matrix(d)
mat[1:4, 1:4]
## chr11_78017064 chrX_141003335 chr15_33864037 chrX_77691570
## chr11_78017064 0.000000 0.998925 0.999075 0.999250
## chrX_141003335 0.998925 0.000000 0.852025 0.863275
## chr15_33864037 0.999075 0.852025 0.000000 0.844025
## chrX_77691570 0.999250 0.863275 0.844025 0.000000
For each position, we computed the percentage of samples that have a coverage of at least 3 at this position. This is meant as a simple score of the data quality of a position that can be used in addition to the separation score to pick mutations for the wet lab experiments. Furthermore, we added simple functional annotations to the variants.
coverage <- read.table(file.path(inputFolder, sampleName, paste(sampleName, "covScore.txt", sep = "_")), header = TRUE, sep = "\t", stringsAsFactors = F, row.names = 1)
coverage$variantName <- rownames(coverage)
head(coverage)
## covScore variantName
## chr11_78017064 0.8333333 chr11_78017064
## chrX_141003335 0.6666667 chrX_141003335
## chr15_33864037 0.5000000 chr15_33864037
## chrX_77691570 0.6666667 chrX_77691570
## chr18_51062723 0.5000000 chr18_51062723
## chrX_141003560 0.6666667 chrX_141003560
coverage <- inner_join(coverage, annotations, by = "variantName")
###Overview To get an overview, we plot the full distance matrix:
library(heatmaply)
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
##
## ======================
## Welcome to heatmaply version 1.5.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags:
## https://stackoverflow.com/questions/tagged/heatmaply
## ======================
heatmaply(mat)
mat2 <- mat
diag(mat2) <- 1
min_dist <- apply(mat2, 1, min) # find minimum distance to other mutations
selected_muts <- which(min_dist < 1) # select those below 0.5 say
mat3 <- mat[selected_muts, selected_muts]
This is what the distance matrix looks like now:
heatmaply(mat3)
To cluster mutations, we create a dendrogram based on the pairwise distances:
mat <- mat3
d_mat <- as.dist(mat)
hc <- hclust(d_mat, "average") ## hierarchical clustering of mutations based on distance matrix
par(cex = 0.6)
plot(hc, main = "Dendrogram based on average pairwise distance", sub = "", xlab = "Separating mutations")
The case seems to be a bit clearer in this case: There is one single mutation that never cooccurs with any of the other mutations.
coverage %>% filter(variantName %in% colnames(mat2))
## covScore variantName REF ALT relevant
## 1 0.8333333 chr11_78017064 G A NONE
## 2 0.6666667 chrX_141003335 T C NONE
## 3 0.5000000 chr15_33864037 G A NONE
## 4 0.6666667 chrX_77691570 A T NONE
## 5 0.5000000 chr18_51062723 T C NONE
## 6 0.6666667 chrX_141003560 C G MODERATE
## 7 0.6666667 chr19_54595789 C G MODERATE
## 8 0.5000000 chr19_54576073 T A MODERATE
## 9 0.5000000 chr9_33795599 T C MODERATE
## 10 0.5000000 chr17_12111084 G T NONE
## 11 0.5000000 chr7_142760456 C G HIGH
However, the coverage score of chr11_78017064 sticks out. Could it be that this variant is found to be more distant than the others because we can determine its genotype with more confidence?
We define a cut point to get distinct branches. These should roughly represent the mutations on the private branches in the posterior sampling. The number of clusters can either be the number of distinct branches in the posterior sampling or decided based on the hierarchical clustering.
par(cex = 0.6)
plot(hc, main = "Dendrogram based on average pairwise distance", sub = "", xlab = "Separating mutations")
abline(h = 0.98, lwd = 2, lty = 2, col = "green")
# geneGroups <- cutree(hc, k = NULL, h = 0.6)
geneGroups <- cutree(hc, k = 2)
Here we decide to cut the data into three clusters eventhough the MAP trees suggest that only two of the four cells from the CTC-cluster form distinct branches. In the MAP trees found for this case, the left-most branch corresponds to the mutations in the smaller private branch. The other two branches correspond to the larger private branch with the middle branch roughly consisting of the mutations lower down in the private branch. The right-most branch consists mostly of the mutations higher up in the branch, but also some mutations in the joint trunk.
To rank the mutations within one cluster, we reduce the distance matrix to the cluster mutations and rank them by their average distance to the other mutations in the cluster.
cluster1 <- names(geneGroups)[geneGroups == 1]
Mutations in cluster:
## [1] "chr11_78017064"
This cluster only consists of one variant, so there is no need to rank the variants.
Technically we should exclude the value on the main diagonal from the average. But it is zero at all positions which means when only ordering the mutations based on the average, it makes no difference.
Top separating mutations (first branch):
top_df <- as.data.frame(0)
colnames(top_df) <- "avgDist"
top_df$variantName <- cluster1
top_muts_1 <- inner_join(top_df, coverage, by = "variantName")
# colnames(top_muts_1)[1] <- "mutation"
top_muts_1
## avgDist variantName covScore REF ALT relevant
## 1 0 chr11_78017064 0.8333333 G A NONE
print(sprintf("Number of mutations in cluster %s with moderate functional impact: %d", clusterName, sum(top_muts_1 == "MODERATE")))
## [1] "Number of mutations in cluster lightcoral with moderate functional impact: 0"
print(sprintf("Number of mutations in cluster %s with high functional impact: %d", clusterName, sum(top_muts_1 == "HIGH")))
## [1] "Number of mutations in cluster lightcoral with high functional impact: 0"